Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support remapping for IVF_FLAT, IVF_PQ and IVF_SQ #2708

Merged
merged 19 commits into from
Dec 20, 2024

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Aug 8, 2024

not support IVF_HNSW_* index yet

prepare for supporting remap for new vector index format,
HNSW remap not supported because simply mapping the row ids could break the connectivity of graph

Signed-off-by: BubbleCal <[email protected]>
@github-actions github-actions bot added the enhancement New feature or request label Aug 8, 2024
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 77.77778% with 76 lines in your changes missing coverage. Please review.

Project coverage is 79.00%. Comparing base (2b29487) to head (c18b4dc).

Files with missing lines Patch % Lines
rust/lance/src/index/vector/builder.rs 76.29% 1 Missing and 31 partials ⚠️
rust/lance-index/src/vector/storage.rs 64.70% 7 Missing and 5 partials ⚠️
rust/lance/src/index/vector/ivf/v2.rs 89.15% 8 Missing and 1 partial ⚠️
rust/lance/src/index/vector/utils.rs 64.28% 4 Missing and 1 partial ⚠️
rust/lance-file/src/v2/writer.rs 69.23% 0 Missing and 4 partials ⚠️
rust/lance-index/src/vector.rs 0.00% 3 Missing ⚠️
rust/lance-index/src/vector/hnsw/builder.rs 0.00% 2 Missing ⚠️
rust/lance-index/src/vector/v3/shuffler.rs 90.90% 0 Missing and 2 partials ⚠️
rust/lance/src/dataset/scanner.rs 77.77% 0 Missing and 2 partials ⚠️
rust/lance/src/index/vector/ivf.rs 85.71% 1 Missing and 1 partial ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2708      +/-   ##
==========================================
+ Coverage   78.80%   79.00%   +0.19%     
==========================================
  Files         246      246              
  Lines       86637    86900     +263     
  Branches    86637    86900     +263     
==========================================
+ Hits        68278    68655     +377     
+ Misses      15529    15378     -151     
- Partials     2830     2867      +37     
Flag Coverage Δ
unittests 79.00% <77.77%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 marked this pull request as draft August 13, 2024 18:16
@wjones127
Copy link
Contributor

@BubbleCal I've marked this as draft, since I'm assuming it is not ready for review. (There are no unit tests.) Mark it as ready for review when it is ready.

Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
@github-actions github-actions bot added the python label Dec 5, 2024
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
lance_io::ReadBatchParams::RangeFull,
4096,
16,
projection,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the part_id in batch, just don't read it to save resources

Signed-off-by: BubbleCal <[email protected]>
@BubbleCal BubbleCal changed the title feat: support remapping vector storage and flat index feat: support remapping for IVF_FLAT, IVF_PQ and IVF_SQ Dec 10, 2024
@BubbleCal BubbleCal marked this pull request as ready for review December 10, 2024 11:24
@@ -134,6 +135,10 @@ impl IvfSubIndex for FlatIndex {
Ok(Self {})
}

fn remap(&self, _: &HashMap<u64, Option<u64>>) -> Result<Self> {
Ok(self.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's add a warning log here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh w8, we should remap sub index here no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for v3 we need to remap the subindex & vector storage. flat index doesn't contain anything so it simply returns itself here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flat map is still { origin_vector: row_id }? if row id changes during compaction, we need to remap them ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remap on an vector index (v3) is:

  • remap the sub index
  • remap the storage
    for IVF_FLAT, the sub index is FLAT and storage is FlatStorage. FLAT sub index doesn't contain any data so no need to do anything here. the remapping happens on FlatStorage

Comment on lines +111 to +112
let batch = concat_batches(self.schema(), batches.iter())?;
Self::try_from_batch(batch, self.distance_type())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess remap is already slow so it probably doesn't matter but it seems odd we would need to concat here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's because try_from_batch is not trivial, e.g. for PQ storage, it would transpose the pq codes

Comment on lines +257 to +286
let element_type = get_vector_element_type(dataset, column)?;
match element_type {
DataType::Float16 | DataType::Float32 | DataType::Float64 => {
IvfIndexBuilder::<FlatIndex, FlatQuantizer>::new(
dataset.clone(),
column.to_owned(),
dataset.indices_dir().child(uuid),
params.metric_type,
Box::new(shuffler),
Some(ivf_params.clone()),
Some(()),
(),
)?
.build()
.await?;
}
DataType::UInt8 => {
IvfIndexBuilder::<FlatIndex, FlatBinQuantizer>::new(
dataset.clone(),
column.to_owned(),
dataset.indices_dir().child(uuid),
params.metric_type,
Box::new(shuffler),
Some(ivf_params.clone()),
Some(()),
(),
)?
.build()
.await?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed there are many lines are doing the same thing: get the vector data type / value type and check it.
so just made the function get_vector_element_type to do this

Comment on lines -417 to -429
// async fn append(&self, batches: Vec<RecordBatch>) -> Result<()> {
// IvfIndexBuilder::new(
// dataset,
// column,
// index_dir,
// distance_type,
// shuffler,
// ivf_params,
// sub_index_params,
// quantizer_params,
// )
// }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah these lines are commented and not used, so just removed them

Comment on lines 525 to 537
async fn write_batches(
path: Path,
batches: impl Iterator<Item = RecordBatch>,
schema: Schema,
) -> Result<usize> {
let object_store = ObjectStore::local();
let writer = object_store.create(&path).await?;
let mut writer = FileWriter::try_new(writer, schema, Default::default())?;
for batch in batches {
writer.write_batch(&batch).await?;
}
Ok(writer.finish().await? as usize)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be part of this PR but it might be nice to have this as a static method on FileWriter.

) -> Result<()> {
let index_dir = dataset.indices_dir().child(new_uuid);
let element_type = get_vector_element_type(dataset, &column)?;
match index.index_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a remap method to the VectorIndex trait instead of using a match statement here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah just tried, it can work!

}
}

async fn test_remap_impl<T: ArrowPrimitiveType>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only test the case where rows are deleted or does it also test the case where fragments are combined and row ids are changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

.open_vector_index(q.column.as_str(), &index.uuid.to_string())
.await?;
let mut q = q.clone();
q.metric_type = idx.metric_type();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes a bug that with unindexed data, the flat search may compute the distances in a different distance type

Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cargo bump triggered a substrait update which is causing the MSRV failure. I'll make a PR to bump our MSRV (probably the easiest fix and 1.80 has been out for six months). No strong opinion on whether you wait for that PR or just merge and break CI.

@BubbleCal BubbleCal merged commit 72ae355 into lancedb:main Dec 20, 2024
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants